This week’s lab is a musical lab. You’ll be requesting data from the Spotify API and using it to build k-nearest neighbor and decision tree models. In order to use the Spotify you must have a Spotify account. If you don’t have one, sign up for a free one here: https://www.spotify.com/us/signup.

Once you have an account, go to Spotify for developers (https://developer.spotify.com/) and log in. Click the green “Create a Client ID” button to fill out the form to create an app create an app so you can access the API. On your developer dashboard page, click on the new app you just created. On the app’s dashboard page you will find your Client ID just under the header name of your app. Click “Show Client Secret” to access your secondary Client ID. When you do this you’ll be issued a Spotify client ID and client secret key.

You have two options for completing this lab.

Option 1: Classify by users. Build models that predict whether a given song will be in your collection vs. a partner in class. This requires that you were already a Spotify user so you have enough data to work with. You will download your data from the Spotify API and then exchange with another member of class.

Option 2: Classify by genres. Build models that predict which genre a song belongs to. This will use a pre-existing Spotify dataset available from Kaggle.com (https://www.kaggle.com/datasets/mrmorj/dataset-of-songs-in-spotify)

#LOAD NECESSARY PACKAGES

library(spotifyr) #API interaction
library(tidyverse) #cleaning data, ggplot, etc 
library(tidymodels) #for modeling/statistical analysis
library(rsample) #for splitting data into train / test
library(recipes) #for creating the recipe for ML
#library(skimr) #for data exploration / early summary stats and viz
library(kknn) #for KNN modeling
library(plotly) #for data viz
library(ggpubr) #for data viz
library(here) #for simplifying file path navigation
library(baguette) #for bagging decision trees
library(ranger) # engine for random forests
library(kableExtra) #for creating a formatted table

Client ID and Client Secret are required to create and access token that is required to interact with the API. You can set them as system values so we don’t have to do provide them each time.

#API key stored in token.R, which is not git tracked for security

source(here("token.R"))

This may result in an error: “INVALID_CLIENT: Invalid redirect URI.” This can be resolved by editing the callback settings on your app. Go to your app and click “Edit Settings”. Under redirect URLs paste this: http://localhost:1410/ and click save at the bottom.

Data Preparation

You can use get_my_saved_tracks() to request all your liked tracks. It would be good if you had at least 150-200 liked tracks so the model has enough data to work with. If you don’t have enough liked tracks, you can instead use get_my_recently_played(), and in that case grab at least 500 recently played tracks if you can.

The Spotify API returns a dataframe of tracks and associated attributes. However, it will only return up to 50 (or 20) tracks at a time, so you will have to make multiple requests. Use a function to combine all your requests in one call.

#GETTING A DATAFRAME OF LIKED SONGS

#In the get_my_saved_tracks() function, you can use offsets() to specify the index of the first track to return. 
offsets = seq(from = 0, to = 1000, by = 50)

#initializing an empty matrix 
liked_tracks <- data.frame(matrix(nrow = 0, ncol = 30))

#Function to get my 1050 most recently liked tracks 
for (i in seq_along(offsets)) {  
  my_tracks = get_my_saved_tracks(limit = 50, 
                                     offset = offsets[i])
  df_temp = as.data.frame(my_tracks) #creating a temporary data frame to store the 50 liked tracks from a given run
  liked_tracks <- rbind(liked_tracks, df_temp) #binding the temporary data frame to my liked tracks data frame. 
}
## Auto-refreshing stale OAuth token.

Once you have your tracks, familiarize yourself with this initial dataframe. You’ll need to request some additional information for the analysis. If you give the API a list of track IDs using get_track_audio_features(), it will return an audio features dataframe of all the tracks and some attributes of them.

#obtain a list of the track IDs 
ids <- liked_tracks$track.id

#the ids argument in `get_track_audio_features()` can only take 100 IDs at a time, so I'm splitting them into 100 track groupings
ids_split <- split(ids, ceiling(seq_along(ids) / 100))

#create an empty list
my_tracks_audio_feats_list <- list()

#Iterating the `get_track_audio_features()`function over each ID split and storing the resulting data in a list.
for (i in 1:length(ids_split)) {
  my_tracks_audio_feats_list[[i]] <- get_track_audio_features(ids = ids_split[[i]])
}

#Combine the list of data frames into a single data frame.
my_tracks_audio_feats <- do.call(rbind, my_tracks_audio_feats_list)

These track audio features are the predictors we are interested in, but this dataframe doesn’t have the actual names of the tracks. Append the ‘track.name’ column from your favorite tracks database.

#selecting the track id and track name from liked tracks so I can use left_join and only add the track name to the audio features data set
liked_tracks_join <- liked_tracks %>%
  select(track.id, track.name, track.artists) %>%
  mutate(primary_artist = unlist(lapply(liked_tracks$track.artists, function(x) x$name[1]))) %>%
  select(-track.artists)

#add the track name and artist to the audio features data frame 
liked_tracks_full <- left_join(my_tracks_audio_feats, liked_tracks_join, by = c("id" = "track.id"))

#save this data frame as a CVS so it can be shared with others
write_csv(liked_tracks_full, "lewis_liked_tracks.csv")

Find a class mate whose data you would like to use. Add your partner’s data to your dataset. Create a new column that will contain the outcome variable that you will try to predict. This variable should contain two values that represent if the track came from your data set or your partner’s.

# reading in Elke's liked tracks. Add a column to make it clear which tracks are Elke's  
elke_liked_tracks <- read_csv(here("elke_liked_tracks.csv")) %>%
  mutate(listener = "Elke")

#add a column named "listener" to make it clear that these are my tracks 
lewis_liked_tracks <- liked_tracks_full %>%
  mutate(listener = "Lewis")

# Combine all of our liked tracks into one data frame 
all_tracks <- rbind(lewis_liked_tracks, elke_liked_tracks) %>%
  select(-(type:analysis_url))

Data Exploration

Let’s take a look at your data. Do some exploratory summary stats and visualization.

For example: What are the most danceable tracks in your dataset? What are some differences in the data between users (Option 1) or genres (Option 2)?

Lists of songs

# names(lewis_liked_tracks)  #for checking out column names

#my most danceable songs 
lewis_liked_tracks %>%
  select(danceability, track.name, primary_artist) %>%
  arrange(-danceability) %>%
  head(5)
## # A tibble: 5 × 3
##   danceability track.name                           primary_artist     
##          <dbl> <chr>                                <chr>              
## 1        0.984 Conceited                            Flo Milli          
## 2        0.975 CAN'T TOUCH THIS                     BIA                
## 3        0.971 We Not Humping - Remix               Monaleo            
## 4        0.964 Slumber Party (feat. Princess Nokia) Ashnikko           
## 5        0.957 Cry Baby (feat. DaBaby)              Megan Thee Stallion
#my least danceable songs
lewis_liked_tracks %>%
  select(danceability, track.name, primary_artist) %>%
  arrange(danceability) %>%
  head(5)
## # A tibble: 5 × 3
##   danceability track.name          primary_artist  
##          <dbl> <chr>               <chr>           
## 1        0.164 Crying All the Time Alexandra Savior
## 2        0.185 Appointments        Julien Baker    
## 3        0.205 Get Free            Lana Del Rey    
## 4        0.215 Funeral             Phoebe Bridgers 
## 5        0.218 Alone Again         The Weeknd
#my most "speechy" songs
lewis_liked_tracks %>%
  select(speechiness, track.name, primary_artist) %>%
  arrange(-speechiness) %>%
  head(5)
## # A tibble: 5 × 3
##   speechiness track.name                                            primary_ar…¹
##         <dbl> <chr>                                                 <chr>       
## 1       0.781 Yoga                                                  645AR       
## 2       0.635 Big Black Truck (with JID)                            Dreamville  
## 3       0.595 Ghost Of Soulja Slim                                  Jay Electro…
## 4       0.59  That's All I Have                                     Lil Wayne   
## 5       0.583 Ghetto Gods Freestyle (with EARTHGANG feat. 2 Chainz) Dreamville  
## # … with abbreviated variable name ¹​primary_artist
#my least "speechy" songs
lewis_liked_tracks %>%
  select(speechiness, track.name, primary_artist) %>%
  arrange(speechiness) %>%
  head(5)
## # A tibble: 5 × 3
##   speechiness track.name    primary_artist  
##         <dbl> <chr>         <chr>           
## 1      0.0241 Me & My Dog   boygenius       
## 2      0.0242 Dreams Tonite Alvvays         
## 3      0.0251 Vessels       Julien Baker    
## 4      0.0254 The Archer    Alexandra Savior
## 5      0.0254 Notion        Tash Sultana

Distribution Plots

#Danceability comparison
dance_plot <- ggplot(all_tracks, aes(x = danceability, fill = listener,
                    text = paste(listener))) +
  geom_density(alpha=0.6, color=NA) +
  scale_fill_manual(values=c("#b0484f", "#4436d9"))+
  labs(x="Danceability", y="Density") +
  guides(fill=guide_legend(title="Listener"))+
  theme_minimal() +
  ggtitle("Distribution of Danceability Data")


#speechiness comparison
speech_plot <- ggplot(all_tracks, aes(x = speechiness, fill = listener,
                    text = paste(listener))) +
  geom_density(alpha=0.6, color=NA) +
  scale_fill_manual(values=c("#b0484f", "#4436d9"))+
  labs(x="Speechiness", y="Density") +
  guides(fill=guide_legend(title="Listener"))+
  theme_minimal() +
  ggtitle("Distribution of Speechiness Data")


#acousticness comparison
acoustic_plot <- ggplot(all_tracks, aes(x = acousticness, fill = listener,
                    text = paste(listener))) +
  geom_density(alpha=0.6, color=NA) +
  scale_fill_manual(values=c("#b0484f", "#4436d9"))+
  labs(x="Acousticness", y="Density") +
  guides(fill=guide_legend(title="Listener"))+
  theme_minimal() +
  ggtitle("Distribution of Acousticness Data")

#energy comparison 
energy_plot <- ggplot(all_tracks, aes(x = energy, fill = listener,
                    text = paste(listener))) +
  geom_density(alpha=0.6, color=NA) +
  scale_fill_manual(values=c("#b0484f", "#4436d9"))+
  labs(x="Energy", y="Density") +
  guides(fill=guide_legend(title="Listener"))+
  theme_minimal() +
  ggtitle("Distribution of Energy Data")

ggarrange(dance_plot, speech_plot, acoustic_plot, energy_plot, ncol=2, nrow=2, common.legend = TRUE, legend="bottom")

#My music appears to be higher energy, more danceable, and more "speechy," while Elke's music tends to be more acoustic. 
#plotting valence and energy to get a sense for the moods of our liked songs 
ggplot(data = all_tracks, aes(x = valence, y = energy, color = listener)) +
  geom_point(alpha = 0.5) +
  geom_vline(xintercept = 0.5) +
  geom_hline(yintercept = 0.5) +
  scale_x_continuous(expand = c(0, 0), limits = c(0, 1)) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 1)) +
  annotate('text', 0.25 / 2, 0.95, label = "Turbulent/Angry", fontface =
             "bold") +
  annotate('text', 1.75 / 2, 0.95, label = "Happy/Joyful", fontface = "bold") +
  annotate('text', 1.75 / 2, 0.05, label = "Chill/Peaceful", fontface =
             "bold") +
  annotate('text', 0.25 / 2, 0.05, label = "Sad/Depressing", fontface =
             "bold") +
  theme_minimal() +
  labs(x = "Valence", 
       y = "Energy",
       title = "Plotting songs based on their positivity and energy level",
       subtitle = "Elke and I don't have many songs in the Chill/Peaceful quadrant.")

## this plotly allows you to hover over points to see the track and the corresponding artist
plot_ly(data = all_tracks,
        x = ~valence, y = ~energy, color = ~listener, type = "scatter", mode = "markers",
        text = paste("Track Name:", all_tracks$track.name, "<br>",
                     "Primary Artist:", all_tracks$primary_artist, "<br>",
                     "Valence:", all_tracks$valence, "<br>", 
                     "Energy:", all_tracks$energy, "<br>",
                     "Listener:", all_tracks$listener)) %>%
  layout(xaxis = list(title = "Valence"),
         yaxis = list(title = "Energy"),
         hovermode = "closest",
         title = "Track Valence vs Energy")

Statistical comparison of variables

#some quick t-tests comparing my music to Elke's 

#danceability 
t.test(lewis_liked_tracks$danceability, elke_liked_tracks$danceability, var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  lewis_liked_tracks$danceability and elke_liked_tracks$danceability
## t = 10.437, df = 1130.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.06852773 0.10025837
## sample estimates:
## mean of x mean of y 
##  0.651099  0.566706
#speechiness
t.test(lewis_liked_tracks$speechiness, elke_liked_tracks$speechiness, var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  lewis_liked_tracks$speechiness and elke_liked_tracks$speechiness
## t = 21.06, df = 1500.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.09176451 0.11061422
## sample estimates:
## mean of x mean of y 
## 0.1561088 0.0549194
#acousticness
t.test(lewis_liked_tracks$acousticness, elke_liked_tracks$acousticness, var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  lewis_liked_tracks$acousticness and elke_liked_tracks$acousticness
## t = -7.2932, df = 910.95, p-value = 6.572e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.16148987 -0.09300632
## sample estimates:
## mean of x mean of y 
## 0.3008656 0.4281137
#energy
t.test(lewis_liked_tracks$energy, elke_liked_tracks$energy, var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  lewis_liked_tracks$energy and elke_liked_tracks$energy
## t = 4.7455, df = 861.4, p-value = 2.434e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.03096822 0.07465271
## sample estimates:
## mean of x mean of y 
## 0.5765427 0.5237322
#Each t.test was significant at the 0.001 significance level, clearly suggesting that my music is different than Elke's across the variables tested. 

Modeling

Create two models, a k-nearest neighbor model and a decision tree model that predict whether a track belongs to you or your partner’s collection. Then validate and compare the performance of the two models you have created.

Make sure to use appropriate resampling to select the best version of each algorithm to compare and some appropriate visualization of your results.

#prepare all_tracks for modeling by removing songs with duplicate values (tough to predict if both Elke and I liked a song), making nominal data a factor, and removing variables that don't make sense to include in the model
all_tracks_modeling <- all_tracks[!duplicated(all_tracks$track.name, fromLast = TRUE) & !duplicated(all_tracks$track.name), ] %>%  
  mutate_if(is.ordered, .funs = factor, ordered = F) %>%
  select(-track.name, -primary_artist) %>%
  mutate(listener = as.factor(listener))
#splitting the data
set.seed(123)
#initial split of data ~ we're going with 70/30 because the sample size isn't super large for testing
tracks_split <- initial_split(all_tracks_modeling, prop = .7)
tracks_train <- training(tracks_split)
tracks_test <- testing(tracks_split)

#pre-processing
# We need to create a recipe and do the pre-processing by converting dummy coding the nominal variables and normalizing the numeric variables.
tracks_recipe <- recipe(listener ~ ., data = tracks_train) %>%
  #step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%
  step_normalize(all_numeric(), -all_outcomes()) %>% #normalize numeric to make sure scale is okay
  prep()

KNN Model

# Define our KNN model with tuning
knn_spec_tune <- nearest_neighbor(neighbor = tune()) %>%
set_mode("classification") %>%
set_engine("kknn")

# Check the model
knn_spec_tune
## K-Nearest Neighbor Model Specification (classification)
## 
## Main Arguments:
##   neighbors = tune()
## 
## Computational engine: kknn
# Define a new workflow
wf_knn_tune <- workflow() %>%
add_model(knn_spec_tune) %>%
add_recipe(tracks_recipe)

# 10-fold CV on the training dataset
set.seed(123)
cv_folds <- tracks_train %>%
  vfold_cv(v = 10)


# Fit the workflow on our predefined folds and hyperparameters
fit_knn_cv <- wf_knn_tune %>%
  tune_grid(
    cv_folds, 
    grid = data.frame(neighbors = c(1, 5, seq(10, 150, 10))) # try with 1 nearest neighbor, try with 5, 10, 20, 30, ..., 100
  )
    
# Check the performance with collect_metrics()
print(n=32, fit_knn_cv %>% collect_metrics()) 
## # A tibble: 34 × 7
##    neighbors .metric  .estimator  mean     n std_err .config              
##        <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
##  1         1 accuracy binary     0.717    10 0.0104  Preprocessor1_Model01
##  2         1 roc_auc  binary     0.645    10 0.0118  Preprocessor1_Model01
##  3         5 accuracy binary     0.735    10 0.0105  Preprocessor1_Model02
##  4         5 roc_auc  binary     0.758    10 0.0133  Preprocessor1_Model02
##  5        10 accuracy binary     0.755    10 0.00761 Preprocessor1_Model03
##  6        10 roc_auc  binary     0.777    10 0.0108  Preprocessor1_Model03
##  7        20 accuracy binary     0.752    10 0.0109  Preprocessor1_Model04
##  8        20 roc_auc  binary     0.788    10 0.0110  Preprocessor1_Model04
##  9        30 accuracy binary     0.758    10 0.0130  Preprocessor1_Model05
## 10        30 roc_auc  binary     0.796    10 0.0106  Preprocessor1_Model05
## 11        40 accuracy binary     0.759    10 0.0150  Preprocessor1_Model06
## 12        40 roc_auc  binary     0.797    10 0.0112  Preprocessor1_Model06
## 13        50 accuracy binary     0.762    10 0.0150  Preprocessor1_Model07
## 14        50 roc_auc  binary     0.799    10 0.0113  Preprocessor1_Model07
## 15        60 accuracy binary     0.761    10 0.0150  Preprocessor1_Model08
## 16        60 roc_auc  binary     0.802    10 0.0120  Preprocessor1_Model08
## 17        70 accuracy binary     0.761    10 0.0172  Preprocessor1_Model09
## 18        70 roc_auc  binary     0.803    10 0.0121  Preprocessor1_Model09
## 19        80 accuracy binary     0.761    10 0.0163  Preprocessor1_Model10
## 20        80 roc_auc  binary     0.803    10 0.0120  Preprocessor1_Model10
## 21        90 accuracy binary     0.761    10 0.0152  Preprocessor1_Model11
## 22        90 roc_auc  binary     0.803    10 0.0119  Preprocessor1_Model11
## 23       100 accuracy binary     0.762    10 0.0150  Preprocessor1_Model12
## 24       100 roc_auc  binary     0.804    10 0.0117  Preprocessor1_Model12
## 25       110 accuracy binary     0.760    10 0.0156  Preprocessor1_Model13
## 26       110 roc_auc  binary     0.804    10 0.0118  Preprocessor1_Model13
## 27       120 accuracy binary     0.759    10 0.0141  Preprocessor1_Model14
## 28       120 roc_auc  binary     0.805    10 0.0117  Preprocessor1_Model14
## 29       130 accuracy binary     0.759    10 0.0144  Preprocessor1_Model15
## 30       130 roc_auc  binary     0.805    10 0.0119  Preprocessor1_Model15
## 31       140 accuracy binary     0.760    10 0.0159  Preprocessor1_Model16
## 32       140 roc_auc  binary     0.805    10 0.0119  Preprocessor1_Model16
## # … with 2 more rows
## # ℹ Use `print(n = ...)` to see more rows
# The final workflow for our KNN model
final_wf <-
  wf_knn_tune %>%
  finalize_workflow(select_best(fit_knn_cv))
## Warning: No value of `metric` was given; metric 'roc_auc' will be used.
# Fitting our final workflow
final_fit <- final_wf %>%
  fit(data = tracks_train)

#generating predictions using the model on the test data
tracks_pred <- final_fit %>% predict(new_data = tracks_test)

tracks_pred %>% head()
## # A tibble: 6 × 1
##   .pred_class
##   <fct>      
## 1 Lewis      
## 2 Lewis      
## 3 Lewis      
## 4 Lewis      
## 5 Lewis      
## 6 Elke
# Write over 'final_fit' with this last_fit() approach
final_fit <- final_wf %>% last_fit(tracks_split)

# Collect metrics on the test data!
final_fit %>% collect_metrics()
## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.751 Preprocessor1_Model1
## 2 roc_auc  binary         0.799 Preprocessor1_Model1

Decision Tree Approach

#new spec, tell the model that we are tuning hyperparams
tree_spec_tune <- decision_tree(
  cost_complexity = tune(),   #tune() asks R to try a bunch of different parameters. 
  tree_depth = tune(),
  min_n = tune()) %>%
  set_engine("rpart") %>%
  set_mode("classification")

#create a grid of options for tuning purposes so we can identify the optimal value of hyperparameters 

tree_grid <- grid_regular(cost_complexity(), tree_depth(), min_n(), levels = 5) #grid_regular shows grid of different input tuning options while levels says how many combinations we should try. 

tree_grid  #125 options 
## # A tibble: 125 × 3
##    cost_complexity tree_depth min_n
##              <dbl>      <int> <int>
##  1    0.0000000001          1     2
##  2    0.0000000178          1     2
##  3    0.00000316            1     2
##  4    0.000562              1     2
##  5    0.1                   1     2
##  6    0.0000000001          4     2
##  7    0.0000000178          4     2
##  8    0.00000316            4     2
##  9    0.000562              4     2
## 10    0.1                   4     2
## # … with 115 more rows
## # ℹ Use `print(n = ...)` to see more rows
wf_tree_tune <- workflow() %>%
  add_recipe(tracks_recipe) %>%
  add_model(tree_spec_tune)
    #when you add recipe into workflow, it automatically will prep and bake when necessary. 
  

#workflow pulls together the specification. and then we can fit on the workflow. you could fit the wf_tree_tune 

#set up k-fold cv. This can be used for all the algorithms. I switch to 5 fold CV instead of 10 (used in KNN above) for computation speed. 
listener_cv <- tracks_train %>% vfold_cv(v=5) #creating 5 folds cross validation 
listener_cv
## #  5-fold cross-validation 
## # A tibble: 5 × 2
##   splits            id   
##   <list>            <chr>
## 1 <split [771/193]> Fold1
## 2 <split [771/193]> Fold2
## 3 <split [771/193]> Fold3
## 4 <split [771/193]> Fold4
## 5 <split [772/192]> Fold5
doParallel::registerDoParallel() #build trees in parallel
#200s

#get the results of the tuning 
tree_rs <- tune_grid(
  wf_tree_tune,
  listener ~ ., #function 
  resamples = listener_cv, #resamples to use
  grid = tree_grid, #grid to try
  metrics = metric_set(accuracy) #how to assess which combinations are best 
)
## Warning: The `...` are not used in this function but one or more objects were
## passed: ''
tree_rs
## # Tuning results
## # 5-fold cross-validation 
## # A tibble: 5 × 4
##   splits            id    .metrics           .notes          
##   <list>            <chr> <list>             <list>          
## 1 <split [771/193]> Fold1 <tibble [125 × 7]> <tibble [0 × 3]>
## 2 <split [771/193]> Fold2 <tibble [125 × 7]> <tibble [0 × 3]>
## 3 <split [771/193]> Fold3 <tibble [125 × 7]> <tibble [0 × 3]>
## 4 <split [771/193]> Fold4 <tibble [125 × 7]> <tibble [0 × 3]>
## 5 <split [772/192]> Fold5 <tibble [125 × 7]> <tibble [0 × 3]>
#Plot the tuning results to visualize which hyperparameter values work best
autoplot(tree_rs) + theme_light() + labs(title = "Decision Tree Tuning Plot")

show_best(tree_rs)
## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric  .estim…¹  mean     n std_err .config
##             <dbl>      <int> <int> <chr>    <chr>    <dbl> <int>   <dbl> <chr>  
## 1    0.0000000001          1     2 accuracy binary   0.716     5 0.00531 Prepro…
## 2    0.0000000178          1     2 accuracy binary   0.716     5 0.00531 Prepro…
## 3    0.00000316            1     2 accuracy binary   0.716     5 0.00531 Prepro…
## 4    0.000562              1     2 accuracy binary   0.716     5 0.00531 Prepro…
## 5    0.1                   1     2 accuracy binary   0.716     5 0.00531 Prepro…
## # … with abbreviated variable name ¹​.estimator
select_best(tree_rs)
## # A tibble: 1 × 4
##   cost_complexity tree_depth min_n .config               
##             <dbl>      <int> <int> <chr>                 
## 1    0.0000000001          1     2 Preprocessor1_Model001
#picking the best model for the final model
final_tree <- finalize_model(tree_spec_tune, select_best(tree_rs))
#Fit the model on the test data 
final_tree_fit <- last_fit(final_tree, listener ~., tracks_split)

final_tree_fit$.predictions
## [[1]]
## # A tibble: 414 × 6
##    .pred_Elke .pred_Lewis  .row .pred_class listener .config             
##         <dbl>       <dbl> <int> <fct>       <fct>    <chr>               
##  1      0.284       0.716     1 Lewis       Lewis    Preprocessor1_Model1
##  2      0.284       0.716     3 Lewis       Lewis    Preprocessor1_Model1
##  3      0.284       0.716     4 Lewis       Lewis    Preprocessor1_Model1
##  4      0.284       0.716     7 Lewis       Lewis    Preprocessor1_Model1
##  5      0.284       0.716    12 Lewis       Lewis    Preprocessor1_Model1
##  6      0.284       0.716    14 Lewis       Lewis    Preprocessor1_Model1
##  7      0.284       0.716    15 Lewis       Lewis    Preprocessor1_Model1
##  8      0.284       0.716    21 Lewis       Lewis    Preprocessor1_Model1
##  9      0.284       0.716    22 Lewis       Lewis    Preprocessor1_Model1
## 10      0.284       0.716    27 Lewis       Lewis    Preprocessor1_Model1
## # … with 404 more rows
## # ℹ Use `print(n = ...)` to see more rows

Bagging Trees

set.seed(123)
bag_spec_tune <- bag_tree(cost_complexity = tune(),
                          tree_depth = tune(),
                          min_n = tune()) %>%
  set_mode("classification") %>%
  set_engine("rpart", times = 50) #50 trees in a bag

#create tuning grid for hyperparamters 
bag_grid <- grid_regular(cost_complexity(), tree_depth(), min_n(), levels = 5)

#make workflow for bagging
wf_bag_tune <- workflow() %>% 
  add_recipe(tracks_recipe) %>%
  add_model(bag_spec_tune)

doParallel::registerDoParallel() #build trees in parallel
#200s

#find results of the tuning process 
bag_rs <- tune_grid(
  bag_spec_tune,
  listener ~ ., #function 
  resamples = listener_cv, #resamples to use
  grid = bag_grid,
  metrics = metric_set(accuracy) #how to assess which combinations are best 
)

#Plot the tuning results to visualize which hyperparameter values work best
autoplot(bag_rs) + theme_light() + labs(title = "Bagged Decision Tree Tuning Plot")

show_best(bag_rs)
## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric  .estim…¹  mean     n std_err .config
##             <dbl>      <int> <int> <chr>    <chr>    <dbl> <int>   <dbl> <chr>  
## 1    0.00000316           11    30 accuracy binary   0.756     5 0.00471 Prepro…
## 2    0.0000000001         15    40 accuracy binary   0.749     5 0.00824 Prepro…
## 3    0.000562              4    11 accuracy binary   0.749     5 0.00620 Prepro…
## 4    0.00000316            8    40 accuracy binary   0.748     5 0.00555 Prepro…
## 5    0.0000000178          8    40 accuracy binary   0.747     5 0.00769 Prepro…
## # … with abbreviated variable name ¹​.estimator
select_best(bag_rs)
## # A tibble: 1 × 4
##   cost_complexity tree_depth min_n .config               
##             <dbl>      <int> <int> <chr>                 
## 1      0.00000316         11    30 Preprocessor1_Model093
#picking the best model for the final model
final_bag <- finalize_model(bag_spec_tune, select_best(bag_rs))
#fitting the best model on the test data
final_bag_fit <- last_fit(final_bag, listener ~., tracks_split)

#checking out the predictions
final_bag_fit$.predictions
## [[1]]
## # A tibble: 414 × 6
##    .pred_Elke .pred_Lewis  .row .pred_class listener .config             
##         <dbl>       <dbl> <int> <fct>       <fct>    <chr>               
##  1     0.307        0.693     1 Lewis       Lewis    Preprocessor1_Model1
##  2     0.165        0.835     3 Lewis       Lewis    Preprocessor1_Model1
##  3     0.270        0.730     4 Lewis       Lewis    Preprocessor1_Model1
##  4     0.182        0.818     7 Lewis       Lewis    Preprocessor1_Model1
##  5     0.0777       0.922    12 Lewis       Lewis    Preprocessor1_Model1
##  6     0.593        0.407    14 Elke        Lewis    Preprocessor1_Model1
##  7     0.566        0.434    15 Elke        Lewis    Preprocessor1_Model1
##  8     0.623        0.377    21 Elke        Lewis    Preprocessor1_Model1
##  9     0.535        0.465    22 Elke        Lewis    Preprocessor1_Model1
## 10     0.204        0.796    27 Lewis       Lewis    Preprocessor1_Model1
## # … with 404 more rows
## # ℹ Use `print(n = ...)` to see more rows

Random Forests

  1. random forest
    • rand_forest()
    • m_try() is the new hyperparameter of interest for this type of model. Make sure to include it in your tuning process

Go through the modeling process for each model: * Preprocessing. You can use the same recipe for all the models you create. * Resampling. Make sure to use appropriate resampling to select the best version created by each algorithm. * Tuning. Find the best values for each hyperparameter (within a reasonable range).

Compare the performance of the four final models you have created.

Use appropriate performance evaluation metric(s) for this classification task. A table would be a good way to display your comparison. Use at least one visualization illustrating your model results.

set.seed(123)

#setting up the random forest specification
forest_spec_tune <- 
  rand_forest(min_n = tune(),
  mtry = tune(),
  trees =  140) %>% # 14 predictors * 10, as suggested by "Hands on Machine Learning in R" by Bradley Boehmke & Brandon Greenwell. Tuning this value led to an extremely long computation time.
  set_engine("ranger") %>%
  set_mode("classification")

#create grid for tuning min_n and the mtry value
forest_grid <- grid_regular(mtry(c(1,13)), min_n(), levels = 5)

#create a workflow for the random forests
wf_forest_tune <- workflow() %>% 
  add_recipe(tracks_recipe) %>%
  add_model(forest_spec_tune)

doParallel::registerDoParallel() #build forest in parallel

#get results of the tuning to try and find optimal model
forest_rs <- tune_grid(
  forest_spec_tune,
  listener ~ ., 
  resamples = listener_cv, #resamples to use
  grid = forest_grid,
  metrics = metric_set(accuracy) #how to assess which combinations are best 
)

#Plot the tuning results to visualize which hyperparameter values work best
autoplot(forest_rs) + theme_light() + labs(title = "Random Forest Tuning Plot")

show_best(bag_rs)
## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric  .estim…¹  mean     n std_err .config
##             <dbl>      <int> <int> <chr>    <chr>    <dbl> <int>   <dbl> <chr>  
## 1    0.00000316           11    30 accuracy binary   0.756     5 0.00471 Prepro…
## 2    0.0000000001         15    40 accuracy binary   0.749     5 0.00824 Prepro…
## 3    0.000562              4    11 accuracy binary   0.749     5 0.00620 Prepro…
## 4    0.00000316            8    40 accuracy binary   0.748     5 0.00555 Prepro…
## 5    0.0000000178          8    40 accuracy binary   0.747     5 0.00769 Prepro…
## # … with abbreviated variable name ¹​.estimator
select_best(bag_rs)
## # A tibble: 1 × 4
##   cost_complexity tree_depth min_n .config               
##             <dbl>      <int> <int> <chr>                 
## 1      0.00000316         11    30 Preprocessor1_Model093
#picking the best model for the final model
final_forest <- finalize_model(forest_spec_tune, select_best(forest_rs))
#fitting the optimal model on the test data
final_forest_fit <- last_fit(final_forest, listener ~., tracks_split)

#check out the predictions and metrics 
final_forest_fit$.predictions
## [[1]]
## # A tibble: 414 × 6
##    .pred_Elke .pred_Lewis  .row .pred_class listener .config             
##         <dbl>       <dbl> <int> <fct>       <fct>    <chr>               
##  1     0.321        0.679     1 Lewis       Lewis    Preprocessor1_Model1
##  2     0.226        0.774     3 Lewis       Lewis    Preprocessor1_Model1
##  3     0.373        0.627     4 Lewis       Lewis    Preprocessor1_Model1
##  4     0.298        0.702     7 Lewis       Lewis    Preprocessor1_Model1
##  5     0.0667       0.933    12 Lewis       Lewis    Preprocessor1_Model1
##  6     0.621        0.379    14 Elke        Lewis    Preprocessor1_Model1
##  7     0.580        0.420    15 Elke        Lewis    Preprocessor1_Model1
##  8     0.610        0.390    21 Elke        Lewis    Preprocessor1_Model1
##  9     0.555        0.445    22 Elke        Lewis    Preprocessor1_Model1
## 10     0.149        0.851    27 Lewis       Lewis    Preprocessor1_Model1
## # … with 404 more rows
## # ℹ Use `print(n = ...)` to see more rows
final_forest_fit$.metrics
## [[1]]
## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.773 Preprocessor1_Model1
## 2 roc_auc  binary         0.834 Preprocessor1_Model1

Comparing each of the models

#cleaning model metrics to create a table with each model's performance metrics 

forest_results_table <- final_forest_fit %>% collect_metrics() %>%
  select(.metric, .estimate) %>%
  mutate(method = "Random Forest")

bagging_results_table <- final_bag_fit %>% collect_metrics() %>%
  select(.metric, .estimate) %>%
  mutate(method = "Bagging")

decision_tree_results_table <- final_tree_fit %>% collect_metrics() %>%
  select(.metric, .estimate) %>%
  mutate(method = "Decision Tree")

knn_results_table <- final_fit %>% collect_metrics() %>%
  select(.metric, .estimate) %>%
  mutate(method = "KNN")

majority_class_results_table <- data.frame (.metric  = c("accuracy", "roc_auc"),
                  .estimate = c(nrow(lewis_liked_tracks) / (nrow(lewis_liked_tracks) + nrow(elke_liked_tracks)), .5),
                  method = c("Dummy Classifier", "Dummy Classifier")
                  )

full_results <- bind_rows(majority_class_results_table,
                          knn_results_table, 
                          decision_tree_results_table,
                          bagging_results_table,
                          forest_results_table) %>%
  select(method, .estimate, .metric)

#table of results 
full_results %>%
  kbl(caption = "Table 1. Classification Methods and their Respective Accuracies") %>%
  kable_classic(full_width = T, html_font = "Cambria")
Table 1. Classification Methods and their Respective Accuracies
method .estimate .metric
Dummy Classifier 0.6774194 accuracy
Dummy Classifier 0.5000000 roc_auc
KNN 0.7512077 accuracy
KNN 0.7992914 roc_auc
Decision Tree 0.7101449 accuracy
Decision Tree 0.5000000 roc_auc
Bagging 0.7632850 accuracy
Bagging 0.8160006 roc_auc
Random Forest 0.7729469 accuracy
Random Forest 0.8344104 roc_auc
#accuracy plot

ggplot(data = full_results, aes(x = method, y = .estimate, fill = .metric)) +
  geom_bar(stat='identity', position='dodge') + 
  theme_minimal() +
  scale_fill_manual(values=c("#9fd182", "#7798c9")) +
  labs(y = "Accuracy Estimate",
       x = "Classification Method",
       fill = "Accuracy Metric", 
       title = "Random Forest Classification Performed the Best Across Both Accuracy Metrics")